Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add(fluentd,fluent-bit): added lifecycle object #185

Merged
merged 2 commits into from
Nov 29, 2021

Conversation

applike-ss
Copy link
Contributor

This PR adds the possibility to add lifecycle management to the pods. That may be helpful to let the pod running long enough so that it can properly be deregistered from a load balancer.

Signed-off-by: Sebastian Struß <struss@justdice.io>
@stevehipwell
Copy link
Collaborator

@applike-ss do you have a particular use case for this?

@applike-ss
Copy link
Contributor Author

Yes, as i stated: We want to give the load balancer enough time to deregister the target and not let fluentd exit right away. As new tasks take some time to get healthy we would loose logs for inputs that are not retried.

@applike-ss
Copy link
Contributor Author

@stevehipwell mind merging it? Or are there open questions?

@stevehipwell
Copy link
Collaborator

@applike-ss I am still struggling to see why you need a lifecycle hook instead of a customisable termination grace period? Is there something non-standard that Fluent Bit does with SIGTERM?

@applike-ss
Copy link
Contributor Author

the termination grace period (if i'm not recalling it incorrectly) is for letting kubernetes know how long fluentd/fluent-bit are allowed to take to shut down. However our problem is NOT that it shuts down slowly (the opposite is the case).
Our Problem is that we want the load balancer to:

  • cap all connections to fluentd by deregistering the target completely
  • shut down fluentd
  • start the new fluentd
  • register it in the target group

@stevehipwell
Copy link
Collaborator

@applike-ss I'm struggling to see why this doesn't already work when the correct K8s configuration is provided (update strategy, PDB, termination grace period)? See Kubernetes best practices: terminating with grace.

@applike-ss
Copy link
Contributor Author

Ahhh now i see where the confusion is coming from. When i was talking about the load balancer, i wasn't thinking of the internal load balancer of k8s, but rather the load balancer of our cloud provider. The current setup makes requests still go onto an already not alive pod when there is a deployment. That is because fluentd's shutdown is quite fast, however the load balancer does only detect it after 2 failed attempts of health checks which have a ~10 second interval.

@stevehipwell
Copy link
Collaborator

@applike-ss technically this should still work, but I'm aware of a number of cases where there can be problems. I assume you've been seeing issues with this already? Which cloud and LB implementation are you using?

@applike-ss
Copy link
Contributor Author

applike-ss commented Nov 29, 2021

Yes, we have seen issues there. We are using AWS NLB. They do have a minimum of 2 healthy/unhealthy healthcheck results before changing status. The timeout needs to be at least 10 seconds there, so that would potentially be ~20 seconds of no logs being ingested into the pipeline for clients that do not retry.

@stevehipwell
Copy link
Collaborator

@applike-ss thanks for the details, this makes sense (kubernetes-sigs/aws-load-balancer-controller#2366 has some more details).

If you're still using the legacy in-tree controller or instance mode you might want to look at the new AWS Load Balancer Controller and IP mode as this removes a lot of the legacy issues.

Copy link
Collaborator

@stevehipwell stevehipwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a minor suggested change and then I'd be happy to merge.

charts/fluent-bit/templates/_pod.tpl Outdated Show resolved Hide resolved
Signed-off-by: Sebastian Struß <struss@justdice.io>
@applike-ss
Copy link
Contributor Author

In fact i recently switched to using IP mode. Prior to that (months ago) i read an article from an aws engineer who was recommending the add a preStop lifecycle, but that's so long ago that my browsers history wouldn't find it.

I implemented your suggested change to both charts. Note though that other properties handled it the same as i did priorly (welcome to copy paste hell).

Copy link
Collaborator

@stevehipwell stevehipwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@stevehipwell stevehipwell merged commit e7e56cd into fluent:main Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants